-
Notifications
You must be signed in to change notification settings - Fork 398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pull main changes for CXI provider into v1.21.x #9932
Closed
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: James Swaro <[email protected]> (cherry picked from commit c1daeeb)
EP objects will be able to support different EP protocols. Currently on the existing portals SAS implementation is supported: FI_PROTO_CXI. NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit b991fd4) (cherry picked from commit 37809dc)
This refactors EP object ctrl elements related to side-band messaging and MR into its own structure. While this information is exclusively accessed for standard EP, it will be owned by the SEP (where MR are bound to the SEP) and shared among TX/RX contexts. NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit 28c0faa) (cherry picked from commit 366a371)
No functional changes; refactors code to have ep_obj reference the txc and rxc via a pointer. This will allow an ep_obj to support multiple context specializations that implement different endpoint protocols. NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit 87c50c8) (cherry picked from commit e703d2f)
NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit 80f01f7) (cherry picked from commit f975589)
NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit 4f1dcf9) (cherry picked from commit d4fe843)
This commit does not alter functionality, it refactors the existing default RXC context into a common base and protocol specific. The default protocol is FI_PROTO_CXI that is implemented by the rxc_hpc derived object. It implements an HPC capable SAS protocol with unexpected messages buffered at the target, and requires a Portals flow control implementation. NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit 1552c80) (cherry picked from commit a09b6e4)
This commit does not alter functionality, it refactors the existing default TXC context into a common base and protocol specific. The default protocol is FI_PROTO_CXI that is implemented by the txc_hpc derived object. It implements an HPC capable SAS protocol with unexpected messages buffered at the target and includes rendezvous messaging. It requires a Portals flow control implementation. NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit fb15795) (cherry picked from commit 8699a70)
Refactor so that context allocation is not entangled with EP object initialization. This will allow for contexts to do specialized initialization of structure at calloc. No functional difference. NETCASSINI-5662 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit f368168) (cherry picked from commit f20412b)
Allocation of a TXC/RXC will allocate and initialize the appropriate derived context object. Context initialization is not longer entangled with EP object initialization. Introduces concept of TXC/RXC ops functions that execute derived object specific code. NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit 5a5661e) (cherry picked from commit b0bc5ad)
Refactor context initialization to make derived object initialize only what it needs. For example overflow and request buffers are only required for HPC derived object. NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit f52389e) (cherry picked from commit 1c63678)
Refactor context disable to call into derived object for cleanup if operation is supported. No new functionality is added; HPC messaging specific cleanup is moved to helper operation. NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit 5482af1) (cherry picked from commit 4670c82)
Refactors code to allow a derived context to implement protocol specific progress. This will allow future protocols with different progress demands not impact existing protocols. NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit 24ea37c) (cherry picked from commit e03c51f)
Allow RXC/TXC specific cancel functions. This will allow the client/server object to support TX cancel when implemented. NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit 581a68f) (cherry picked from commit 1268475)
Add RXC op to implement a control messaging callback which can override processing of control messaging events. This allows a context protocol to implement a specific side-band messaging implementation. NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit 95aa467) (cherry picked from commit 462a4c5)
Refactor code to allow derived RXC/TXC to have unique respective recv_common and send_common functionality. Future protocol will integrate seamlessly into API flow. NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit 7cd4f60) (cherry picked from commit 5bd6f39)
Move HPC specific protocol code to new file cxip_msg_hpc.c while leaving common protocol code in cxip_msg.c. This only refactors the code. NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit d292b5b) (cherry picked from commit 5250ddf)
Adds the file cxi/src/cxip_msg_cs.c NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit fe0d962) (cherry picked from commit cd9f6ee)
Return fi_info for new protocol, protocol must be explicitly requested if hints are passed. Note that if FI_CXI_COMPAT=2, only old constants are used and new protocol is not present. Update/add unit tests to validate fi_info and selection. NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit 550516f) (cherry picked from commit 9a9920a)
Add initial FI_PROTO_CXI_CS derived rxc/txc structure initialization and man page update. NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit 1d10c13) (cherry picked from commit 39b7792)
NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit c6516da) (cherry picked from commit 35ad891)
Refactor RXC code to pull out common function to complete put that was delivered directly to user buffer. NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit 3dc0b73) (cherry picked from commit c87db83)
NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit 08c60a4) (cherry picked from commit 4718f4d)
This allows the checked to be removed from non-tagged messages and tag bit size to be checked in debug builds only. NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit c1f6cab) (cherry picked from commit a39b6f6)
Note that we could remove the tagged bit and used two ptlte. It is a trade between hardware resources and match bit use; currently the ptlte resource is considered more valuable. NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit ba2cd46) (cherry picked from commit e49de31)
Handle differences in endpoint protocol match bits and avoid unneeded instructions. NETCASSINI-56542 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit f67ba8e) (cherry picked from commit dd86701)
NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit 6bd2919) (cherry picked from commit a7400ef)
Does not implement hybrid MR yet/selective completion yet. NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit bfb4bb7) (cherry picked from commit c3ccdd6)
Does not implement hybrid MR, IDC, or no success event operation yet. Commits will follow that add the additional capability. NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit 14e9fd6) (cherry picked from commit d92e5a3)
With CS protocol, send operation can be cancelled. Allow this status to be returned to the user. NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit 6fc892a) (cherry picked from commit 7b76af4)
CURL errors should be logged to stderr. Signed-off-by: Joe Nemeth <[email protected]> (cherry picked from commit 9abaeef) (cherry picked from commit edae249)
In production, we want to optionally support peer verification. In testing, we generally do not. This can now be specified using environment variable CURLOPT_SSL_VERIFYPEER to bee 0 (do not verify) or 1 (verify). The default is 0. Signed-off-by: Joe Nemeth <[email protected]> (cherry picked from commit ef30f59) (cherry picked from commit 09dc0f2)
Evaluate the simulation mode once, and set mc_obj->is_multicast appropriately. Signed-off-by: Joe Nemeth <[email protected]> (cherry picked from commit 1131419) (cherry picked from commit 817914d)
Allow retries to be disabled for test cases. Signed-off-by: Joe Nemeth <[email protected]> (cherry picked from commit 9e05ee0) (cherry picked from commit 5fdece8)
Add COMM_KEY_NONE to _gen_tx_dfa() function. Signed-off-by: Joe Nemeth <[email protected]> (cherry picked from commit 5c0bd49) (cherry picked from commit d1b78ad)
Allow CURL operations to be traced independently of JOIN. Signed-off-by: Joe Nemeth <[email protected]> (cherry picked from commit 9457dd0) (cherry picked from commit 9c9f861)
Cleanup. Signed-off-by: Joe Nemeth <[email protected]> (cherry picked from commit d383b51) (cherry picked from commit 4643419)
FM now generates a full 6-octet NIC address. Signed-off-by: Joe Nemeth <[email protected]> (cherry picked from commit 75d0879) (cherry picked from commit 21d3083)
Add flag to suppress repeated logging during CURL polling. Signed-off-by: Joe Nemeth <[email protected]> (cherry picked from commit c2ea999) (cherry picked from commit 9d6c113)
Remove unused pid_idx value in cxip_join_state structure. Signed-off-by: Joe Nemeth <[email protected]> (cherry picked from commit 0ab258f) (cherry picked from commit 73cda5e)
Change minimum test size to 2 (endpoints), from 4. Add "/op" to performance output to clearly indicate that the performance value is per-operation, not a total runtime. Signed-off-by: Joe Nemeth <[email protected]> (cherry picked from commit f215798) (cherry picked from commit fba9ece)
Added SLURM and FI_CXI environment variable capture. Changed error output to stderr (not stdout). Removed placeholder defaults for environment variables. Signed-off-by: Joe Nemeth <[email protected]> (cherry picked from commit 80e756f) (cherry picked from commit 40c90a4)
Change cxip_trace_filename to cxip_trace_pathname and allow tracing to occur in alternate directories, which is useful when the current path is not writable by the user. Initialization fails without initializing if no masks are selected, preventing creation of empty files. Early model of initializing only once at test login was flawed. This now can be initialized, disabled, and re-initialized. Signed-off-by: Joe Nemeth <[email protected]> (cherry picked from commit 7a75110) (cherry picked from commit 9a2acf4)
Checkpoint commit. This code is in development. Signed-off-by: Joe Nemeth <[email protected]> (cherry picked from commit e4354f2) (cherry picked from commit eeffe4b)
Comment is incorrect and misleading for PID_IDX value for mcast address. Signed-off-by: Joe Nemeth <[email protected]> (cherry picked from commit 5c533ce) (cherry picked from commit 4ae71d6)
Signed-off-by: Kalyan Kodamagula <[email protected]> (cherry picked from commit 9d1afe8) (cherry picked from commit 763404e)
Note: all CXIP_TRACE* references changed to CXIP_COLL_TRACE* Note: all cxip_trace* references changed to cxip_coll_trace* The TRACE() macros produce debugging traces to files that can be on a shared file system, or local to a physical node (and could be memory storage) for debugging collectives, which perform coordinated actions across multiple nodes. This not only prevents implicit synchronization of operations through shared file system waits, but also prevent mangling of the output when using normal character buffering from multiple sources, which is usually faster than line buffering. This was originally put together for use with bench tests that are part of the libfabric suite, and required initialization through function calls within the bench tests, which makes this feature unavailable to to external applications. This commit refactors the TRACE() system to allow it to be entirely configured through environment variables, and can be used with production applications. If the ENABLE_DEBUG flag is zero, all of the TRACE featues are removed entirely: embedded TRACE() calls are a syntactically-robust NOOP that does not emit code during compilation. Otherwise, individual trace features must be activated through environment variables, allowing different areas of code to be traced selectively. If no trace features are selected, the trace files are not created. The original design also used function pointer indirection to allow all of the trace functions to be entirely replaced. This was confusing to maintain, and offers no real benefit. The former cxip_coll_trace_enable() function was overloaded with multiple purposes. This has been simplified into cxip_coll_trace_init() and cxip_coll_trace_close(), which are automatically called during coll module initialization, and a global cxip_coll_trace_muted flag that can be used to temporarily mute tracing. This allows repeated reductions (for instance) to be traced during set up, but then muted during a fast loop. Signed-off-by: Joe Nemeth <[email protected]> (cherry picked from commit 3cd63c5) (cherry picked from commit ec28d2e)
Use ofi_hmem_* instead of ze_* specific calls NETCASSINI-4994 Signed-off-by: Chuck Fossen <[email protected]> (cherry picked from commit 782cf95) (cherry picked from commit f9f0a19)
NETCASSINI-4994 Signed-off-by: Chuck Fossen <[email protected]> (cherry picked from commit 25f5ddc) (cherry picked from commit 27d17a7)
Libfabric semantics indicate that fi_cntr_wait() if an error count increment occurs before the threshold is reached. NETCASSINI-5909 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit 9180ae2) (cherry picked from commit 67fb889)
Adds unit tests for verification of fi_cntr_wait() semantic operation with error count increment. NETCASSINI-5909 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit fa5d0db) (cherry picked from commit f323a5d)
Signed-off-by: James Swaro <[email protected]> (cherry picked from commit 03b283b)
Signed-off-by: James Swaro <[email protected]> (cherry picked from commit 2c0712a)
@jswaro did PR9926 already cherry-pick these over? |
Oh, I didn't see that. My mistake. Thanks for closing @j-xiong |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is the same set of changes that went into ofiwg/main earlier today as part of pulling in the changes for RC2.